Bivariate Analysis¶
Importing Required libraries¶
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sb
Loading the dataset¶
data= pd.read_csv(r"C:\Users\DEEL\OneDrive\Documents\Clean_course_rec2.csv")
data
| job_id | job_title | category | course_title | skills | |
|---|---|---|---|---|---|
| 0 | 3900085113 | human resources manager bilingual spanish | hr | beginner to pro in powerpoint complete powerpo... | ai, c, erp, powerpoint, r, training |
| 1 | 3900085113 | human resources manager bilingual spanish | hr | how to create amazing cinemagraphs with micros... | ap, c, erp, microsoft, microsoft powerpoint, p... |
| 2 | 3900085113 | human resources manager bilingual spanish | hr | logo design in powerpoint | design, erp, powerpoint, r |
| 3 | 3900085113 | human resources manager bilingual spanish | hr | business card design in powerpoint | ar, c, ca, design, erp, powerpoint, r |
| 4 | 3900085113 | human resources manager bilingual spanish | hr | flat icon design in powerpoint | c, design, erp, powerpoint, r |
| ... | ... | ... | ... | ... | ... |
| 4303 | 3899516898 | business development manager | business development | create kindle ebook covers with powerpoint | c, erp, powerpoint, r |
| 4304 | 3899516898 | business development manager | business development | plantillas powerpoint para publicar en mercado... | ar, c, ca, erp, lan, powerpoint, r |
| 4305 | 3899516898 | business development manager | business development | basic graphic design for powerpoint | ap, c, design, erp, graphic design, phi, power... |
| 4306 | 3899516898 | business development manager | business development | how to design professional powerpoint business... | design, erp, powerpoint, presentation, r |
| 4307 | 3899516898 | business development manager | business development | self advertise using powerpoint twitter and fa... | c, erp, fa, ios, powerpoint, r |
4308 rows × 5 columns
Checking for Unique Values¶
data.nunique()
job_id 27 job_title 24 category 5 course_title 1946 skills 680 dtype: int64
Vectorizing Skills Feature , creating a cosine matrix and ploting a heatmap between Job Title Similarity based on Skills¶
finding how similar different job titles are based on the skills they require.
It uses TF-IDF to convert skills into numbers and cosine similarity to measure how close the jobs are to each other.
Finally, a heatmap is used to visualize the results.
Step-by-Step Explanation¶
Grouping skills by job title
- Collects all skills related to each job title.
- Joins multiple skills into a single string so every job has one combined skill set.
Converting skills into numerical features (TF-IDF)
- TF-IDF assigns weights to each skill:
- Common skills across many jobs get lower weight.
- Unique/important skills get higher weight.
- This helps highlight distinctive skills for each job.
- TF-IDF assigns weights to each skill:
Measuring similarity between job titles
- Cosine similarity is applied to the TF-IDF values.
- Produces a similarity score between 0 and 1:
- 1 → very similar job roles.
- 0 → completely different skills.
Creating a similarity matrix
- Builds a table where both rows and columns represent job titles.
- Each cell contains the similarity score between two job titles.
Visualizing with a heatmap
- Displays the similarity matrix as a colored grid.
- Brighter colors = higher similarity.
- Darker colors = lower similarity.
- Makes it easy to spot clusters of jobs with overlapping skills.
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
job_skills = data.groupby('job_title')['skills'].apply(lambda x: ', '.join(x)).reset_index()
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(job_skills['skills'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
similarity_df = pd.DataFrame(cosine_sim, index=job_skills['job_title'], columns=job_skills['job_title'])
plt.figure(figsize=(12, 10))
sb.heatmap(similarity_df, cmap='viridis')
plt.title('Job Title Similarity based on Skills')
plt.xlabel('Job Title')
plt.ylabel('Job Title')
plt.show()
Box PLot to detect the number of skills required in particular catagory¶
This code calculates how many skills each job listing has and then visualizes the distribution of skill counts across job categories using a box plot.
Step-by-Step Explanation¶
Counting the number of skills per job
- Each job’s skills are stored as a comma-separated list.
- The number of commas is counted, and
+1is added (since the number of items = commas + 1). - A new column
skill_countis created to store this value for each job.
Creating a box plot
- The x-axis represents different job categories.
- The y-axis represents the number of skills required for jobs in that category.
- Each box shows:
- Median line → the typical number of skills required.
- Box edges (Q1 and Q3) → the middle 50% of skill counts.
- Whiskers → the range of most data points.
- Outliers → jobs that require unusually high or low numbers of skills.
Improving visualization
- The plot size is increased for clarity.
- Category names on the x-axis are rotated to prevent overlap.
data['skill_count'] = data['skills'].str.count(',') + 1
# Plotting the box plot
plt.figure(figsize=(12, 7))
sb.boxplot(x='category', y='skill_count', data=data)
plt.title('Distribution of Skill Count per Job Category')
plt.xlabel('Job Category')
plt.ylabel('Number of Skills')
plt.xticks(rotation=45) # Rotate category names if they overlap
plt.show()
Validating the outliers existing in finace catagory courses¶
Step-by-Step Explanation¶
Filter the Finance category
- Select only rows where the job category is
finance. - This narrows down the dataset to just finance-related courses/jobs.
- Select only rows where the job category is
Calculate Q1, Q3, and IQR
- Q1 (25th percentile): The value below which 25% of the data falls.
- Q3 (75th percentile): The value below which 75% of the data falls.
- IQR (Interquartile Range): The difference
Q3 - Q1, showing the middle spread of the data.
Determine the outlier threshold
- Outliers are defined as values greater than
Q3 + 1.5 * IQR. - This is a standard statistical rule for identifying unusually high data points.
- Outliers are defined as values greater than
Filter out the outliers
- Identify all finance jobs/courses where the skill count is above this threshold.
- These rows are potential outliers.
Display the results
- Show relevant details (
course_title,skills,skill_count) for the outlier rows. - Sorting them by
skill_counthelps in examining the most extreme cases.
- Show relevant details (
Why We Are Doing This¶
The box plot revealed that the Finance category has many outliers in terms of required skills.
We wanted to check if these outliers were due to:
- Repeated skills being listed multiple times, or
- Legitimately high skill requirements.
After inspection, we found that the skills listed were legitimate, meaning that finance courses/jobs often demand significantly more skills than typical categories.
Purpose of the Code¶
- To validate whether outliers in the Finance category are data quality issues or true skill requirements.
- Helps confirm that the Finance sector legitimately requires a broader and deeper set of skills compared to other categories.
finance_data = data[data['category'] == 'finance']
Q1 = finance_data['skill_count'].quantile(0.25)
Q3 = finance_data['skill_count'].quantile(0.75)
IQR = Q3 - Q1
outlier_threshold = Q3 + 1.5 * IQR
print(f"Finance Category Q1: {Q1}")
print(f"Finance Category Q3: {Q3}")
print(f"Finance Category IQR: {IQR}")
print(f"Anything above {outlier_threshold:.2f} skills is an outlier.")
# 4. Filter the DataFrame to find and display the outliers
finance_outliers = finance_data[finance_data['skill_count'] > outlier_threshold ]
# Display the interesting columns for these outlier rows
print("\n--- Outliers in the Finance Category ---")
finance_outliers[['course_title', 'skills','skill_count']].sort_values(by='skill_count')
Finance Category Q1: 3.0 Finance Category Q3: 5.0 Finance Category IQR: 2.0 Anything above 8.00 skills is an outlier. --- Outliers in the Finance Category ---
| course_title | skills | skill_count | |
|---|---|---|---|
| 2429 | build enterprise applications with angular 2 a... | ap, applications, ar, c, ca, enterprise applic... | 9 |
| 2910 | how to make it work successfully in capital ma... | ap, api, ar, c, ca, capital markets, make, r, sf | 9 |
| 2922 | canva graphics design essential training for e... | ai, ap, c, ca, canva, design, phi, r, training | 9 |
| 2939 | canva graphic design theory volume1 | ap, c, ca, canva, design, graphic design, heor... | 9 |
| 3039 | graphic design double your sales with canva | ap, c, ca, canva, design, graphic design, phi,... | 9 |
| 3119 | canva graphic design theory volume2 | ap, c, ca, canva, design, graphic design, heor... | 9 |
| 3682 | build enterprise applications with angular 2 a... | ap, applications, ar, c, ca, enterprise applic... | 9 |
| 4179 | learn facebook flux architecture for web appli... | ap, applications, ar, c, ca, fa, pp, r, web ap... | 9 |
| 3730 | ruby on rails training and skills to build web... | ai, ap, applications, c, ca, pp, r, training, ... | 9 |
| 3769 | start web development with gis map in javascript | ap, ar, c, development, java, javascript, pm, ... | 9 |
| 3806 | master electron desktop apps using html, javas... | ap, c, css, html, java, javascript, ml, pp, r | 9 |
| 3864 | web application development using redis, expre... | ap, application development, c, ca, developmen... | 9 |
| 3987 | html5 and css3 learn web design with html css ... | ap, ar, c, css, design, html, ml, r, web design | 9 |
| 4132 | javascript promises applications in es6 and an... | ap, applications, ar, c, ca, java, javascript,... | 9 |
| 4175 | web application development learn by building ... | ap, application development, ar, c, ca, develo... | 9 |
| 3740 | servlets and jsps tutorial learn web applicati... | ap, applications, ar, c, ca, java, pp, r, web ... | 9 |
| 2780 | sensitivity scenario analysis for ca cfa cpa e... | ar, c, ca, cfa, cpa, fa, r, scenario analysis,... | 9 |
| 2800 | financial management capital market instruments | ap, api, ar, c, ca, cia, financial management,... | 9 |
| 2530 | web application development learn by building ... | ap, application development, ar, c, ca, develo... | 9 |
| 2441 | ruby on rails training and skills to build web... | ai, ap, applications, c, ca, pp, r, training, ... | 9 |
| 2443 | servlets and jsps tutorial learn web applicati... | ap, applications, ar, c, ca, java, pp, r, web ... | 9 |
| 2456 | master electron desktop apps using html, javas... | ap, c, css, html, java, javascript, ml, pp, r | 9 |
| 2464 | web application development using redis, expre... | ap, application development, c, ca, developmen... | 9 |
| 2520 | javascript promises applications in es6 and an... | ap, applications, ar, c, ca, java, javascript,... | 9 |
| 2532 | learn facebook flux architecture for web appli... | ap, applications, ar, c, ca, fa, pp, r, web ap... | 9 |
| 4198 | learn web development by creating a social net... | ar, c, cia, development, net, network, pm, r, ... | 9 |
| 3856 | build your own calculator app with javascript,... | ap, c, ca, css, html, java, javascript, ml, pp, r | 10 |
| 4176 | rails ecommerce app with html template from th... | ai, ap, c, ecommerce, html, mef, ml, pp, r, rest | 10 |
| 2653 | risk analysis capital budgeting for ca cs cfa ... | ap, api, budgeting, c, ca, capital budgeting, ... | 10 |
| 2531 | rails ecommerce app with html template from th... | ai, ap, c, ecommerce, html, mef, ml, pp, r, rest | 10 |
| 2543 | fintech and the transformation in financial se... | banking, blockchain, business transformation, ... | 10 |
| 4049 | learn html css how to start your web developme... | ar, c, ca, css, development, html, ml, pm, r, ... | 10 |
| 3784 | web development with html css bootstrap jquery... | ap, c, css, development, html, jquery, ml, pm,... | 10 |
| 2833 | school of raising capital agile financial mode... | agile, ai, ap, api, c, ca, cia, financial mode... | 10 |
| 2463 | build your own calculator app with javascript,... | ap, c, ca, css, html, java, javascript, ml, pp, r | 10 |
| 2652 | working capital management for ca cfa cpa exams | ap, api, c, ca, capital management, cfa, cpa, ... | 11 |
| 2544 | supply chain finance and blockchain technology | accounts payable and receivable, blockchain, c... | 14 |
| 2539 | intuit academy bookkeeping | accounting, accounting software, accounts paya... | 14 |
| 2542 | digital transformation in financial services | banking, blockchain, business analysis, busine... | 18 |
| 2538 | introduction to finance and accounting | account management, accounting, accounts payab... | 21 |
| 2541 | financial management | account management, accounting, accounts payab... | 23 |
| 2540 | business and financial modeling | accounting, basic descriptive statistics, busi... | 28 |
| 2537 | business foundations | account management, accounting, accounts payab... | 36 |
Plotting scatter plot to analyze the relation of skills and courses¶
This code clusters courses based on the similarity of their required skills and visualizes the clusters using Principal Component Analysis (PCA) in an interactive Plotly scatter plot.
Step-by-Step Explanation¶
Group skills by course
- Combines all skills listed under each
course_titleinto a single string. - Ensures that each course has one consolidated skill set to analyze.
- Combines all skills listed under each
Convert skills into numerical features (TF-IDF)
TfidfVectorizerconverts skill text into numerical vectors.- Assigns higher weights to unique skills and lower weights to common ones.
- This allows us to represent each course in terms of distinctive skills.
Compute similarity between courses
cosine_similaritymeasures how close two courses are based on skills.- Produces a similarity matrix where values range from:
- 1 → very similar courses (skills overlap a lot).
- 0 → very different courses (skills do not overlap).
Dimensionality reduction using PCA
- Cosine similarity produces a high-dimensional matrix.
- PCA reduces this data to 2 dimensions (PCA1, PCA2) for visualization.
- Courses with similar skills will be positioned closer together in this reduced space.
Create a DataFrame for visualization
- Stores the PCA results in a DataFrame.
- Adds the corresponding course titles for reference.
Visualize with an interactive scatter plot (Plotly)
- Each point represents a course.
- Courses closer to each other indicate similar skill sets.
- Hovering over points shows the course title.
- Makes it easy to explore clusters and identify related courses interactively.
Purpose of the Code¶
- To cluster courses based on their skills and visualize relationships.
- Helps answer questions like:
- Which courses have overlapping skill requirements?
- Are there clear clusters of courses focusing on similar topics?
- Can we spot outlier courses that require very unique skills?
!pip install plotly
import plotly.express as px
from sklearn.decomposition import PCA
course_skills = data.groupby('course_title')['skills'].apply(lambda x: ', '.join([skill for sublist in x for skill in (sublist if isinstance(sublist, list) else [sublist])])).reset_index()
vectorizer = TfidfVectorizer()
tfidf_matrix = vectorizer.fit_transform(course_skills['skills'])
cosine_sim = cosine_similarity(tfidf_matrix, tfidf_matrix)
pca = PCA(n_components=2)
pca_result = pca.fit_transform(cosine_sim)
pca_df = pd.DataFrame(pca_result, columns=['PCA1', 'PCA2'])
pca_df['course_title'] = course_skills['course_title']
fig = px.scatter(pca_df, x='PCA1', y='PCA2',
hover_data=['course_title'],
title='Course Clustering by Skills')
fig.show()
Requirement already satisfied: plotly in e:\ana\new folder\lib\site-packages (5.24.1) Requirement already satisfied: tenacity>=6.2.0 in e:\ana\new folder\lib\site-packages (from plotly) (8.2.3) Requirement already satisfied: packaging in e:\ana\new folder\lib\site-packages (from plotly) (24.1)